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Penalized methods are commonly used for selecting variables and fitting 
high-dimensional data. It is well known that the LASSO is biased and thus 
cannot attain the estimation efficiency of the oracle selector. Recent studies 
[6, 7, 8, 10, 11] showed that due to the interference of the bias, the LASSO 
requires quite strong conditions for consistent variable selection. Since the 
i\ penalty has the smallest bias among all convex penalty functions with se- 
lection features, these studies naturally draw our attention to methodologies 
based on concave penalties, or equivalently, nonconcave penalized likelihood. 

Frank and Friedman [5] considered the l a penalty for general a > 0, which 
is strictly concave for a < 1. Their main interest was to use a as a hyperpa- 
rameter to "bridge" between the subset selection with a = and the ridge 
regression with a = 2. Important progresses were made by Fan and Li [3], 
who advocated the unbiasedness and continuity as essential for variable se- 
lectors and carefully developed the SCAD method. In the theoretical front, 
Fan and Peng [4] proved that the SCAD has the oracle property when the 
number of variables is a certain fraction power of the sample size. However, 
nonconcave penalized likelihood methods are still commonly viewed as com- 
putationally limited and poorly understood, especially when the number of 
variables exceeds the number of data points. 

Zou and Li made two significant contributions by addressing both the 
computational and efficiency issues. They developed a fast iterative algo- 
rithm for minimizing nonconcave penalized likelihood and proposed a simple 
one-step method with the oracle property for full rank designs. We congrat- 
ulate them on this important work. In what follows we relate our work to 
theirs through discussions on continuity, computational strategies, selection 
consistency and oracle efficiency. 
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1. The MC+ method. Consider penalized squared loss of the form 

(i-i) ^l|y-x/3|| 2 + X>(|^A), 

3=1 

where y G M. n is a response vector, X = (xi, . . . , x p ) is a design matrix with 
p covariate vectors Xj, and p(t; A) is a penalty function indexed by A > 0. 

In [9] we introduced and studied the MC+, which offers fast, contin- 
uous, nearly unbiased and accurate penalized variable selection in high- 
dimensional linear regression, including the case of p 3> n. The MC+ has 
two elements: a minimax concave penalty (MCP) and a penalized linear un- 
biased selection (PLUS) algorithm. 

The MCP, given by 

(1.2) p(t;\) = \J*(l-^j dx 

with a regularization parameter 7, minimizes the maximum concavity 

(1.3) K(p;X)=max{-p(t;X)}, p{t) = (8 / dtf p(t; A) , 

among all penalty functions satisfying the constraints 

(1.4) p(*;A) = W> 7 A, p(0+;A) = A, 

where p(t; A) = (d/dt)p(t; A). 

Let p m (t) denote a quadratic spline in [0, 00) with m knots throughout this 
discussion, including as a knot. The PLUS computes potentially multiple 
solutions of the Karush-Kuhn-Tucker-type conditions 



(1.5) 



X ;.(y - X/3(A))/n = sgn(/3j(A))/j(|/3j(A)|; A), ft (A) ± 0, 
>;.(y-X3(A))/n| <A, ft(A) = 0, 



for the possibly nonconvex (1.1), with a penalty of the form p(t; A) = X 2 p m (t/\). 
This includes the l\ penalty with m = 1, the MCP with m = 2 and the SCAD 
with m = 3. The output of the PLUS forms a certain main branch of the 
graph of the entire solution set of (1.5). The main branch is a continuous 
piecewise linear path encompassing from the origin to an "optimal fit" for 
zero penalty. Other branches of the solution graph form separate loops. The 
PLUS computes one line segment in the main branch in each step and its 
computational cost is the same as the LARS per step. For m = 1, the PLUS 
becomes the LARS. Moreover, as 7 — ► 00, the MC+ converges to the LASSO 
for all datasets. 
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2. Some simulation results. We present some simulation results to demon- 
strate the performance of the LASSO, MC+ and SCAD. Our experiment 
involves three settings identified by (n,p,d°, /3»), where 

(2.1) d° = #{j:P j7 tO}, /^minl&l, 

characterize the sparsity of the unknown (3. The first two settings are the 
same as Example 1 of Zou and Li with (3 = (3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0)' and 
n = 50 and 100 respectively. In the third setting, we choose (n,p,d°, (3*) = 
(100, 300, 15, 1). We divide the 300 components of the (3 vector into 25 con- 
tinuous blocks, assign the smaller (3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0)'/1.5 to 5 ran- 
dom blocks and set entries in other blocks to 0. This makes variable selec- 
tion harder in the third setting. We generate 1000 independent replications 
of (X,y) in each setting, where y = X/3 + e, e ~ N(Q,T) and X has i.i.d. 
Gaussian rows with correlation 2~' fe ~ J ' between the jth and kth. entries in 
the same row. For all three procedures, the regularization parameter 7 is 
defined as the smallest one for the first part of (1.4) to hold, or equivalently 
7 = inf{t>0:p m (t) = 0}. 

We report our results for A = y/2(\ogp) /n in Table 1. The meaning of 
MRME is as in Zou and Li. All other entries are averages of the 1000 repli- 
cations, with ME being the prediction risk as in Zou and Li, MSR being 
1 1 X/3 — X/3 1| 2 as the squared loss for the mean vector, CS being I{A = A°} 
as the indicator for correct selection, TM = \A\ A°\ + \A° \ A\ as the total 
miss in selection, and the number of PLUS steps k as a measurement of 



Table 1 

Performance of LASSO, MC+ and SCAD 



Method(7) 


MRME 


ME 


MSE 


CS 


TM 


k 






ti = 50, p=12, d° = 3, /?* = 


1.5 






LASSO(oo) 


0.6129 


0.2192 


9.1740 


0.481 


0.687 


4.733 


MC+(3.7) 


0.1957 


0.0753 


3.3993 


0.878 


0.128 


7.317 


SCAD(3.7) 


0.1847 


0.0689 


3.1224 


0.878 


0.135 


10.843 






n = 100, p = 


12, d° = 3, P, = 


1.5 






LASSO(oo) 


0.6794 


0.1007 


9.2221 


0.512 


0.636 


4.650 


MC+(3.7) 


0.2264 


0.0361 


3.4795 


0.868 


0.139 


7.189 


SCAD(3.7) 


0.2157 


0.0327 


3.1617 


0.868 


0.140 


10.532 






n = 100, p = 300, d° = 15, (3, 


= 1 






LASSO(oo) 




2.5849 


148.2765 


0.000 


8.271 


25.227 


MC+(2.7) 




0.2689 


20.0116 


0.859 


0.191 


41.500 


SCAD(3.7) 




1.3174 


80.4836 


0.322 


1.686 


59.657 


MC+(2.5) 




0.2373 


17.9240 


0.870 


0.178 


44.156 


SCAD(2.5) 




0.5510 


37.6191 


0.787 


0.349 


115.322 
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Fig. 1. The average of 1000 simulated ME of the MC+ (solid) and SCAD (dashed) as 
functions of A/>/(logp)/ra for (n,p,d° , (3, ,7) = (50,12,3,1.5,3.7), (100,12,3,1.5,3.7) and 
(100,300,15,1,2.5), from the left, with dotted vertical at \ — -\/2(logp)/n. 

computational complexity. Here, 

(2.2) v4° = {j:/^0} and l={j:^0} 

are the oracle and selected models respectively, and the MRME is undefined 
for p> n. We plot the average of ME and TM against A in Figures 1 and 2. 
From these results, we observe that the SCAD and MC+ perform similarly 
in the first two settings, while the MC+ has much stronger performance in 
the difficult third setting. The LASSO performs poorly in all three settings. 
This certainly does not represent a thorough simulation comparison of the 
three methods, but it is consistent with our other simulation experiments 
[9]. 

We do not have a definite explanation for the difference in the performance 
of the SCAD between our simulation and that of Zou and Li in the first two 
settings, but we offer the following observations. It is clear from Figure 1 
that the prediction risk of the SCAD is quite flat in a wide region down to 
A/-\/ (logp) /n = 1 at least, so that choosing A by CV may cause over fit in 
view of Figure 2 and the results of Zou and Li. Moreover, 5-fold CV is not 
designed to choose A accurately unless the dependence of the "optimal" A on 
n is carefully adjusted, for example, with the factor n -1 / 2 . For example, the 
penalized loss (1.1) with p(t;X) = X\t\ is equivalent to ||y — X/3|| 2 /2 + A||/3||i 
for the LASSO with the scale change A — > nA, but the best penalty levels in 
5-fold CV in these two formulations have effectively 20% difference without 
adjustment for n. Of course, this second problem with CV diminishes if we 
increase the number of CV folds. 

Consider the first two settings in our experiment. In our theorems, a 
basic upper bound is 2{p — d°)Q{—\y/n) for the probability of selecting 
some variables with (3j =0, which amounts to 0.232 for A = y / 2(logp) /n. 
This roughly explains the proportion of incorrect selection 1— CS for the 
MC+ and SCAD in Table 1. On the other hand, the unbiasedness requires 
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/?* = 1.5 > 7A with 7 = 3.7, which allows up to A/ \J (log p)/n = 1.82 > \/2 
and 2.57 > y/2 respectively for n = 50 and 100. Thus, in such cases with 
a strong signal, the SCAD and MC+ perform better with the larger A as 
shown in Figures 1 and 2. 

3. Continuity and unbiasedness. Let A C {1, . . . ,p} represent the model 
with covariate vectors x,-, j £ A. Define 

(3.1) 3 A (A) = argminj -L||y - X A /3J 2 + £ p(|/%|; A)l, 

where (3 A = (f3j,j £ A)' and = (xj,j G A). In [9] we prove that, for 
rank(X J 4) = \A\ and fixed A, (3 A (X) is continuous in y iff —pit) < ^^(X^X^/n) 
almost everywhere in t > 0, where c m j n (M) is the smallest eigenvalue of M. 
Thus, the global convexity condition n(p; A) < c m i n (X'X/re) characterizes the 
global continuity of the global minimizer of (1.1), in the sense of sufficiency 
and near necessity. 

For p > n and sparse f3 with d° = {j : (3j 7^ 0} < n, we look for sparse 
(local) minimizers of (1.1), so that we care about the sparse continuity of 
solutions of (1.5) in the sense of the continuity of (3.1) in y for all \A\ < d* , 
for certain rank d* > d°. This sparse continuity property is characterized by 
the following sparse convexity condition: 

(3.2) K(p;X)< min c min (X.' A X A /n). 

\A\=d* 

Under this condition and subject to #{j : (3j 7^ 0} < d*/2, the solution of 
(1.5) is the unique local minimizer of (1.1) given A and y and is continuous 
in y. The global and sparse convexity conditions are not properties of the 
penalty alone, but they provide the rationale for the use of (1.3) as the 
measurement of the concavity of the penalty. 




12 4 12 4 12 4 



Fig. 2. The average of 1000 simulated TM of the MC+ (solid) and SCAD (dashed) as 
functions of X/y/ (logp)/n for the experiments in Figure 1. 
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The penalty function has selection features if p(0+; A) > 0. We use the sec- 
ond part of (1.4) to standardize the index A so that it has the interpretation 
as the threshold for (3j for standardized designs with ||xj|| 2 /ra = 1. Fan and 
Li [3] pointed out that penalized estimators are (nearly) unbiased beyond 
a second threshold 7A if the first part of (1.4) holds. Thus, the constraints 
in (1.4) are natural for unbiased selection. Given (1.4), the MCP provides 
the sparse continuity for the largest possible rank d* in (3.2), as it mini- 
mizes k(p;X). Conversely, given a fixed value of «(p;A), the MCP provides 
the smallest second threshold 7A for the unbiasedness, with 7 = l/n(p;\). 
Thus, the MCP ensures the continuity and unbiasedness of sparse local min- 
imizers of (1.1) to the greatest extent for general design matrices X. This 
analysis provide a new point of view since it is clearly different from previ- 
ous characterizations of penalty functions based on their performance with 
orthonormal. 

4. The PLUS algorithm. Consider penalties of the form p(t; A) = X 2 p m (t/X). 
Let z* = X'y/n, Xj = x^X/n and r = 1/A. With the scale change z = tz* 
and b = r/3, (1.5) becomes 

/a -1 \ f H - Xjb = sgn{bj)p m (\bj\), bj / 0, 

1 > l>j-Xj-b| <1 = A™(0+), bj = 0. 

The solutions z © b of (4.1) form a smooth p-dimensional surface S in ]R 2p 
composed of (2m + l) p p-parallelepipeds. Given the data z*, the rescaled 
solution set of (1.5) is the intersection of this p-surface S and the half p- 
subspace {(tz*) ©b : r > 0}. Almost everywhere in X and 7, this intersection 
is composed of a main branch and separate loops. The main branch is piece- 
wise linear, begins with b = 0, and ends with a solution of perfect fit for 
z* [also for y if rank(X) = n]. The PLUS algorithm begins with b = and 
tracks the main branch of the solution set of (1.5) by finding in its kth 
step a second endpoint of its fcth line segment. Since each step of the PLUS 
algorithm travels through one distinct parallelepiped in the surface S, the 
algorithm stops in finitely many steps. Our computational strategy for this 
special nonconvex minimization problem is different from the algorithms of 
Zuo and Li and other existing iterative ones which converge to a single local 
minimizer for fixed penalty levels A. 

Under the global convexity condition, the PLUS finds the unique solution 
of (1.5) for all A as the global minimizer of (1.1). Otherwise, the value of 
r = 1/A may not be monotone in the PLUS path, so that multiple local 
minimizers of (1.1) are obtained. We choose the sparsest solution within the 
PLUS path for a given penalty level. For variable selection purposes, we 
typically use the universal penalty level a^2(logp)/n or a slightly larger A 
for large p and standardized designs with ||xj|| 2 /?i = 1. We estimate a based 
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on certain mean residual squares with a theoretically justified formula for 
degrees of freedom. 

Since the PLUS path has to make a turn whenever one of the bj hits a 
knot of p m , the LASSO is the simplest to compute with m = 1, the MC+ 
is the next for m = 2, and then the SCAD for m = 3, so on and so forth. 
The computational complexity is also regulated by the ways the surface S 
folds as a p- vector valued function of z. The orientation of the surface is de- 
termined by the eigenvalues of X^X^/ra + diag(p(6j; A), j £ A) in individual 
p-parallelepipeds with A = {j :bj / 0}. Thus, the complexity of the PLUS 
algorithm is also controlled by the maximum concavity A). For example, 
when k(p; A) = 1/(7 — 1) for the SCAD increases from 1/2.7 to 1/1.5, the 
number of required PLUS steps nearly doubles as reported in Table 1. 

5. Selection consistency and oracle estimation efficiency. A variable se- 
lector is consistent if P{A° = A} — > 1 with the A° and A in (2.2). Under 
selection consistency, efficient estimation after selection yields oracle effi- 
ciency under simpler regularity conditions for the d°-dimensional estimation 
problem as in Theorem 4 of Zou and Li. In this sense, selection consistency 
implies oracle estimation efficiency 

In [9], we prove that the PLUS solution of (1.5) is selection consistent 
under mild conditions on j3 and X in the linear model 

(5.1) y = X/3 + £, e~iV(0,a 2 I). 

Here we state a simplified version of the theorem. Let X^ be as in (3.1). 
The design matrix X satisfies the sparse Riesz condition if 

(5.2) c*< ||X A b|| 2 /n<c* V||b|| = l,\A\ <d*, 

that is, all the eigenvalues of X'^X^/n have to lie inside [c*,c*] as long 
as I vl| < d* . The connection of (5.2) to the Riesz condition on norms was 
discussed in [10], while sufficient conditions for (5.2) for random matrices 
were provided in [1, 10], including d* = e an for fixed positive {c*,c*, a}. The 
quantities c* and c* have been considered as sparse minimum and maximum 
eigenvalues in [2, 7]. 

Theorem 1. Let (X,y) be as in (5.1) with ||xj|| 2 /n = 1 and be the 
oracle LSE with {j:(3° / 0} = A° and G A°)' = (X. f Ao X A o)~ 1 X' Ao y , 

where A° is as in (2.2). Let d° and /3* be as in (2.1). Let p(t;X) = X 2 p m (t/X) 
be a quadratic spline penalty function satisfying (1.4). Suppose (5.2) holds 
with K(p m ;l) < c*. Then, there exist constants Mi and Mi depending on 
{c*,c*} and p m only, such that for 



(5.3) ft >Mi<7^/(l + logp)/ra, M 2 d° + l<d\ 
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and A = o\]1 (log pjjn, the PLUS solution (3 = /3(A) of (1.5) satisfies 
(5.4) P{A / A°} < P{/3 / 3° or sgn(3) / sgn(/3)} as p = p n ^ oo. 

The selection consistency in Theorem 1 implies that the PLUS solution 
(3 achieves the oracle estimation efficiency of j3 . Here sgn(/3) means the 
application of the sign function per component with the convention sgn(O) 
0. We note that {p, d* , d°, /3*} are all allowed to depend on n in Theorem 1. 
A more general version of Theorem 1 in [9] also allows (c*,c*) dependent on 
n, larger A or bounded p. An interesting aspect of this result is its validity 
in cases where p is as large as e an for a fixed small a > 0. The one step 
estimator of Zou and Li can be viewed as adaptive LASSO [12]. As such, it 
requires an initial estimator which essentially separates the zero and nonzero 
(3j for a certain unspecified threshold. 
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